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Abstract 

Internet mapping projects generally consist in sampling the network from a limited set 
of sources by using traceroute probes. This methodology, akin to the merging of spanning 
trees from the different sources to a set of destinations, leads necessarily to a partial, incomplete 
map of the Internet. Accordingly, determination of Internet topology characteristics from such 
sampled maps is in part a problem of statistical inference. Our contribution begins with the 
observation that the inference of many of the most basic topological quantities - including 
network size and degree characteristics - from traceroute measurements is in fact a version 
of the so-called 'species problem' in statistics. This observation has important implications, 
since species problems are often quite challenging. We focus here on the most fundamental 
example of a traceroute internet species: the number of nodes in a network. Specifically, 
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we characterize the difficulty of estimating this quantity through a set of analytical arguments, 
we use statistical subsampling principles to derive two proposed estimators, and we illustrate 
the performance of these estimators on networks with various topological characteristics. 

Keywords: Internet sampling, species problem, statistical inference. 

1 Introduction 

A significant research and technical challenge in the study of large information networks is related 
to the incomplete character of the corresponding maps, usually obtained through some sampling 
process. A prototypical example of this situation is faced in the case of the physical Internet. 
The topology of the Internet can be investigated at different granularity levels such as the router 
and Autonomous System (AS) level, with the final aim of obtaining an abstract representation 
where the set of routers (ASs) and their physical connections (peering relations) are the vertices 
and edges of a graph, respectively dlU- In the absence of accurate maps, researchers rely on a 
general strategy that consists in acquiring local views of the network from several vantage points 
and merging these views. Such local views are obtained by evaluating a certain number of paths 
to different destinations, through the use of probes or the analysis of routing tables, which we will 
refer to generically in this paper as 'traceroute-like sampling', after the quintissential example 
of the well-known traceroute tool. The merging of several of these views provides a map of a 
sampling of the Internet. 

While the knowledge of basic Internet topology (i.e., nodes and links) discovered through such 
sampling is of significant value in and of itself, it is natural to also want to use the resulting sample 
maps to infer properties of the overall Internet map. With such a strategy in mind, a number of 
research groups have generated sample maps of the Internet HI |5J |5J |7| that have then been 
used for the characterization of network properties. For example, the 'small world' character of 
the Internet has thus been uncovered. Moreover, the probability that any vertex in the graph has 
degree k (i.e., that it has exactly k links joining it to immediate neighbors) has been characterized 
as being skewed and heavy-tailed, with an approximately power-law functional form [8|. 

Recently, the question of the accuracy of the topological characteristics inferred from such 
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maps has been the subject of various studies 10 EDI HUGH El CD- Overall, these studies suggest 
that at a qualitative level the main conclusions drawn from traceroute-like samplings are reli- 
able. For example, it has been found that such samplings allow for accurate discrimination between 
topologies with degree distributions that are heavy-tailed from those that are homogeneous ifTTll. 
On the other hand, at a quantitative level the evidence suggests the possibility for considerable 
deviations between numerical summaries of characteristics of the sampled networks and those of 
the actual Internet. 

The point of departure for our contributions is the observation that the inference, from traceroute- 
like measurements, of many common measures of network graph characteristics is in fact related 
to the so-called 'species problem' in statistics. This association with the species problem has im- 
portant implications because, while the species problem is well-studied, it is also known to be a 
statistical inference problem that is often particularly difficult. Therefore, for example, in the con- 
text of Internet mapping and inference with traceroute, while it is clear that the observed number 
of nodes, links, and vertex degrees necessarily will underestimate the actual Internet values, it turns 
out that the accurate adjustment of the observed values may be nontrivial. Furthermore, the unique 
nature of traceroute-like sampling procedures means that standard tools for species estimation 
are unlikely to be immediately applicable. 

This paper is organized as follows. We provide general background on traceroute and the 
species problem in Section El We then focus on what is arguably the most fundamental species 
problem in the context of traceroute-like measurements: inferring the number of nodes in a 
network. In Section El we present an analytical argument characterizing the structural elements 
relevant to this estimation problem. In Section |H we propose two estimators, derived from prin- 
ciples of statistical subsampling. In Section we describe the results of an extensive numerical 
evaluation of these estimators. Finally, Section |6] contains some additional discussions and direc- 
tions for future work. 



3 



2 Background 



Throughout this paper we will represent an arbitrary network of interest as an undirected, con- 
nected graph Q = (<V, &), where *V is a set of vertices (nodes) and £ is a set of edges (links). 
Denote by N = I'VI and M = \S\ the numbers of vertices and edges, respectively. In a typical 
traceroute study, a set S = {si, . . . , s ns ] of n s active sources deployed in the network sends 
probes to a set T = [t\, . . . , t nr ) of n T destinations (or targets), for S, T c *V. Each probe col- 
lects information on all the vertices and edges traversed along the path connecting a source to a 
destination [ 15 1. The actual paths followed by the probes depend on many different factors, such 
as commercial agreements, traffic congestion, and administrative routing policies, but to a first 
approximation are often thought of (and frequently modeled as) 'shortest' paths. The merging of 
the various sampled paths yields a partial map of the network (Figd). This map may in turn be 
represented as a sampled subgraph Q* = {'V* ,&*). 

Numerous metrics are used in networking (and indeed across the network-oriented sciences 
more generally) to summarize characteristics of a network graph Q. Some of the most fundamental 
metrics include the number of vertices, N, the number of edges, M, and the degrees {ki} of vertices 
i 6 r V. Many other metrics either may be expressed as explicit functions of these or have closely 
related behavior. For an arbitrary metric, say rj = t](Q), summarizing some characteristic of Q, 
and a traceroute-sampled graph Q* , it is natural to wish to produce an estimate, say 77 from the 
measurements underlying Q* . However, some caution is in order, in that for the quantities N, M, 
and k h the problem of their inference is closely related to the so-called species problem in statistics. 

Stated generically, the species problem refers to the situation in which, having observed n 
members of a (finite or infinite) population, each of whom falls into one of C distinct classes 
(or 'species'), an estimate C of C is desired. This problem arises in numerous contexts, such as 
numismatics (e.g., how many of an ancient coin were minted [ 16 1), linguistics (e.g., what was the 
size of an author's apparent vocabulary JT7i ri8l). and biology (e.g., how many species of animals 
inhabit a given region). 

The species problem has received a good deal of attention in statistics. See ifT^ll for an overview 
and an extensive bibliography. Perhaps surprisingly, however, while the estimation of the relative 
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frequencies of species in a population is well-understood (given knowledge of C), the estimation 
of C itself is often difficult. In essence, what is needed is to estimate the number of species not 
observed. This task is problematic due to the fact that it is precisely the species present in relatively 
low proportions in the population that are expected to be missed, and there could be an arbitrarily 
large number of such species in arbitrarily low proportions. Despite (or perhaps because of) the 
difficulty of the problem, numerous methods have been proposed for its solution, differing mainly 
in the assumptions regarding the nature of the population, the type of sampling involved, and the 
statistical machinery used. 

An understanding of the implications of the species problem on network topology inference 
is of critical importance. For example, we note that in traceroute-like sampling the problem of 
estimating the number of vertices N in a network graph Q may be mapped to a species problem 
by considering each separate vertex i as a 'species' and declaring a 'member' of the species i to 
have been observed each time that i is encountered on one of the n = n s n T traceroute paths. 
A similar argument shows that estimation of the number of edges M too may be mapped to a 
species problem. Finally, as in [20 1, the problem of inferring the degree fc, of a vertex i from 
traceroute measurements can also be mapped to the species problem, by letting all edges incident 
to i constitute a species and declaring a member of that species to have been observed every time 
one of those edges is encountered. Because the values N, M, and {ki}i e <v are both important in 
their own right and bear important relations to other metrics of interest, it is logical to focus upon 
the question of their inference. In this paper, we concentrate on the inference of the first of these 
quantities, N. 

3 Inferring N: Characterization of the Problem 

Before proceeding to the construction of estimators for N, as we will do in SectionHJ it is useful to 
first better understand the structural elements of the problem. In particular, the following analysis 
provides insight into the structure of the underlying 'population', the relative frequency of the 
various 'species', and the impact of these factors on the problem of inferring N. For the sake of 
exposition, in this section we adopt the common convention of modeling Internet routing, to a first 
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approximation, as 'shortest-path' routing. However, we hasten to note that such an assumption, 
or even an assumption of a static routing protocol, are nowhere made in the derivation of the 
estimators in Section HJ 

A crucial quantity in the characterization of traceroute-like sampling is the so-called between- 
ness centrality, which essentially counts for each vertex the number of shortest paths on which it 
lies: nodes with large betweenness lie on many shortest paths and are thus more easily and more 
frequently probed lfT2"ll . More precisely, if Dhj is me tota l number of shortest paths from vertex 
h to vertex j, and Dhj(i) is the number of these shortest paths that pass through the vertex i, the 
betweenness of the vertex i is defined as b { = 2 T>hj(i) I Dhj, where the sum runs over all h, j pairs 
with j ±h± i. It can be shown lETll that the average shortest path length between pairs of vertices, 
t, is related to the betweenness centralities through the expression 



J]b i = N(N-w-i). 

i 

This may be rewritten in the form 

where the expectation E[-] is with respect to the distribution of betweenness across nodes in the 
network i.e., P(b) = #{i e*V : b t = b}/N. 

Empirical experiments suggest that the average shortest path length £ can be estimated quite 
accurately, which is not surprising given the path-based nature of traceroute. Therefore, the 
problem of estimating N is essentially equivalent to that of estimating the average betweenness 
centrality. Motivated by the fact that Internet maps have been found to display a broad distribution 
of not only degrees, but also betweenness [ 12 1, let us consider a model that pictures the distribution 
of the betweenness as divided into two parts. That is, we model the distribution P(b) as a mixture 
distribution 02l 

P(b) = nP l (b) + (1 - n)P 2 (b), (2) 

where Pi is a distribution at low values b e [l,b min ), for some b min small, and /^(fr) is a distribution 
at high values b <= [b min , b max ], b max » b min . 

The average E[b] in ([TJ is a weighted combination of two terms i.e., E[b] = nEi[b] + (1 - 
n)E2[b]. From the perspective of the simple parametric model just described, the challenge of 



accurately estimating E[b] - and hence N - can be viewed as a problem of the accurate estimation 
of the two means, Ei[b] and E 2 [b], and the weight n. Unfortunately, the first mean, Ei[b], requires 
knowledge of the betweenness of vertices with "small" betweenness. That is, knowledge of nodes 
i e "V traversed by relatively few paths. But these are precisely the nodes on which we receive the 
least information from traceroute-like studies, as they are expected to be visited infrequently 
or not at all. And the relative proportion n of such nodes would seem to be similarly difficult to 
determine. As mentioned earlier, this is a hallmark characteristic of the species problem, i.e. the 
lack of accurate knowledge of the relative number in the population of comparitively infrequently 
observed species. 

As for the second mean, E 2 , let us approximate the observed broad distribution of betweeness 
in the tail by a heavy-tailed power-law form i.e., P 2 (b) = b~@/K, where K is a normalization 
constant. Then 

1 f^max 

E 2 [b] = - b^db. (3) 

K Jb„„„ 

A simple calculation yields K = (b l m ~ax - -j8). Additionally, if the only origin of the cutoff 

is the finite size of the network, b max can be defined by imposing the condition that the expected 
number of nodes beyond the cut-off is bounded by a fixed constant Q. Therefore one finds that 

Nx P(b)db~l =* £w~ £ J —\ -bniM-^N)'^, (4) 

i.e. a relation between b max , b mtn N, n and ft, in which we have also used the assumption b min <sc b max 

that implies K -b'J/ifi- 1). 

Our empirical studies indicate that the exponent /3 can be estimated fairly accurately from 

the distribution of betweenness' observed through traceroute measurements. And the above 

calculations suggest that knowledge of J3 is key to knowledge of K. However, note from © that 

b m ax involves not only the unknown Af, as would be expected, but also n, which suggests that even 

the inference of the E 2 component of E[b] is potentially impacted by our ability (or lack thereof) to 

recover information on nodes with low betweenness. Furthermore, we mention that our numerical 

studies show that /3 is in fact likely quite close to 2 in the real Internet, which suggests an additional 

level of subtlety in the accurate estimation of E 2 , due to the nature of the integral in ©. 

The above analysis both highlights the relevant aspects of the species problem inherent in 
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estimating N and indicates the futility of attempting a classical parametric estimation approach. 
One is led, therefore, to consider nonparametric methods, in which models with a small, fixed 
number of parameters are eschewed in favor of models that essentially have as many parameters 
as data. 

From the perspective of classical nonparametric species models in statistics, the estimation of 
the total number of vertices N, the total number of edges M, and the node degree kj are all non- 
standard statistical inference problems. Consider the classical idealized model where the observed 
frequencies for different species are truncated Poisson variables conditionally on their positivity. 
Suppose the Poisson intensities for all the species (including unobserved ones) form a random sam- 
ple from a completely unknown distribution. Then it is known that in this nonparametric Poisson 
mixture model, the estimation of the total intensity of unobserved species is a well-posed problem 
ll2"31 [241. but the estimation of the total number of species is ill-posed [ 20 1 from an information 
theoretical point of view. This indicates the ill-posedness of the problems of estimating M and 
ki without assuming a parametric model for the distribution of the betweenness centrality, since 
under Poissonized sampling, the betweenness centrality is proportional to the marginal intensity 
for links, or species in these problems. However, for the estimation of N, vertices are treated as 
species, and they can be thought of as being first sampled with roughly equal probability as targets 
and then with unequal probability as intermediate nodes in traceroute experiments. This sug- 
gests the estimation of N is more akin to that of the total intensity of unobserved species, since the 
total unobserved intensity is simply the product of the number of unobserved species and the com- 
mon intensity when the species are equally likely to be included in the sample. This observation is 
crucial in our derivation of the leave-one-out estimator in Section |4~21 

4 Estimators of Network Size 

A naive estimator of N is simply N*, the number of nodes observed in the traceroute study. 
Given the levels of coverage afforded by the scale of current Internet mapping initiatives, N* can 
be expected to vastly underestimate N (e.g., [IT2T). Motivated by the results and discussion in 
Section in this section we develop two nonparametric estimators for N, using subsampling 
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principles. 



4.1 A Resampling Estimator 

A popular method of subsampling is that of resampling, which underlies the well-known 'boot- 
strap' method (25). Given a sample x\, ...,x„ from a population, resampling in its simplest form 
means taking a second sample x\, x* m from x\, x„ to study a certain relationship between the 
first sample and the population through the observed relationship between the second and first 
samples. We utilize a similar principle here to obtain a factor by which the observed number of 
vertices N* is inflated to yield an estimator N RS of N. 

Consider the quantity N*/N i.e., the fraction of nodes discovered through traceroute sam- 
pling of Q, which we will call the discovery ratio. The expected discovery ratio E[N* /N] has been 
found to vary smoothly as a function of the fraction q T = n T /N of targets sampled, for a given 
number n s of sources IfT^lflUl . We will use this fact, paired with an assumption of a type of scal- 
ing relation on Q, to construct our estimator for N. Specifically, we will assume that the sampled 
subgraph Q* is sufficiently representative of Q so that a sampling ratio on Q* similar to that used 
in its obtention from Q yields a discovery ratio similar to the fraction of nodes discovered in Q. 

That is, suppose that we choose a set 5* of n* s source vertices in Q* and a set T* of n* T target 
vertices, in a manner similar to the way that the original sets S and T underlying Q* were chosen, 
and such that q* s ~ q s and q* T ~ q T , where q* s = n*/N*, q* T = n* T /N*, q s = n s /N and q T is 
defined above. Then we assume that the result of a traceroute study on Q*, from sources in S* 
to targets in T*, will yield a subsubgraph, say Q**, of N** nodes, such that on average the discovery 
ratio N** IN* on Q* is similar to the fraction of vertices of Q discovered originally through Q* . 
In other words, we assume that A^*/^V ~ E[N**/N* | Q*], where the expectation E[ ■ \Q*] is with 
respect to whatever random mechanism drives the choice of source and target sets S* and T* on Q* , 
conditional on fixed Q* . Our empirical studies, using uniform random sampling on the networks 
described in Section|51 suggest that this assumption is quite reasonable over a broad range of values 
for q T , as shown in Fig.|21 

Writing E* [•] = E[ ■ \Q*], the condition of equal discovery rates can be rewritten in the form Af ~ 
N*(N*/E*[N**]). The quantity E*[N**] can be estimated by repeating the resampling experiment 
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just described some number B of times, compiling subsubgraphs Q*^, . . . , Q*£ of sizes N**, . . . , N^*, 
and forming the average N** = (1 /B) ^N". Substitution then yields 

N RS =N*-^ (5) 

as a resampling-based estimator for N. 

Note, however, that its derivation is based upon the premise that q* s = q s and q* T = qr, and 
qs,qr are unknown (i.e., since N is unknown). To address this issue, we first let n* s = n s , since 
typically the number of sources is too small to make q s a useful quantity. Then we note that the 
expression q* T = q T , in conjunction with our assumption on discovery rates, together imply that 
n* T /n T ~ E*[N**]/N*. With respect to the calculation of N RS , this fact suggests the strategy of 
iteratively adjusting n* T until the relation n* T ln T w N** IN* holds. Alternatively, one may picture 
the situation geometrically, as shown in Fig. |3] The value of ^V** for the appropriate n* T is then 
substituted into © to produce Nrs ■ In practice, one may either use a fixed value of B throughout 
or, as we have done, increase B as the algorithm approaches the condition n* T ln T ~ N** /N*. 

4.2 A 'Leave-One-Out' Estimator 

Various other subsampling paradigms might be used to construct an estimator. A popular one 
is the 'leave-one-out' strategy underlying such methods as 'jack-knifing' lEol UJ\ and 'cross- 
validation' [25 1, which amounts to subsampling Q* with n* T = n T — 1. The same underlying 
principle may be applied in a useful manner to the problem of estimating N, in a way that does not 
require the scaling assumption underlying ©, as we now describe. 

Recall that *V* is the set of all vertices discovered by a traceroute study, including the n s 
sources S = { s\, . . . , s ns } and the n T targets T = {t\, . . . ,t nT \. Our approach will be to connect N to 
the frequency with which individual targets tj are included in traces from the sources in S to the 
other targets in T \ {tj}. Accordingly, let r V*. . be the set of vertices discovered on the path from 

J I, J 

source s { to target tj, inclusive of s t and tj. Then the set of vertices discovered as a result of targets 
other than a given tj can be represented as f Y^ _ ~ = U ; - U ^- "V*^. Next define Sj = I \tj £ < V*_ J ~ ) } to 
be the indicator of the event that target tj is not 'discovered' by traces to any other target. The total 
number of such targets is X = £ • 6j. 
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We will derive a relation between X and N through consideration of the expectation of the 
former. Under an assumption of simple random sampling in selecting target nodes from *V, given 
a pre-selected (either randomly or not) set of source nodes, we have 

N - N*_ 

Pr (Sj = 1 1 %_ ,) = ^— - , (6) 

v J ( 3)1 N - n s - n T + 1 



where N£ , 



v (-i) 



. Note that, by symmetry, the expectation E [a^_ ;) ] is the same for all j: we 
denote this quantity by E [A 7 ^]. As a result of these two facts, we may write 

*y N - n s - n T + 1 N - n s - n T + 1 

which may be rewritten as 

n T E \N* ]-(n s + n T -l)E[X] 

N = — . (8) 

n T -E[X] 

To obtain an estimator for Af from this expression it is necessary to estimate E [a^J and E[X], 
for which it is natural to use the unbiased estimators A^* } = (l/n T ) 2; ^* ^ and X itself, which 
is measured during the traceroute study. However, while substitution of these quantities in the 
numerator of © is fine, substitution of X for E[X] in the denominator can be problematic in the 
event that X = n T . Indeed, when none of the targets tj are discovered by traces to other targets, 
as is possible if q T = n T /N is small, N will be estimated by infinity. A better strategy is to 
estimate the quantity 1 /(n T - X) directly. Under the condition that Nf_* « ^(-n ~ ^t-j-j'v wriere 
N(_j_j, ) = f yi-j) n ^l-f) ' an d our assumption of simple random sampling of target vertices, it is 
possible to produce an approximately unbiased estimator of this quantity, which upon substitution 
yields 

n T + 1 n rN* - (n s + n T - Y)X 

= ~ i Fryi • (9) 

n T n T + 1 - E[X] 

Formal derivation of the leave-one-out estimator Nlw in ® ma y be found in the appendix. 

Note that even if X = n T , the estimator remains well-defined. The condition that all < V(_,- ) and 

their pairwise intersections have approximately the same cardinality is equivalent to saying that 

the unique contribution of discovered vertices by any one or any pair of vertices is relatively small. 

For example, using data collected by the Skitter project at CAIDA 01, a fairly uniform discovery 
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rate of roughly 3 new nodes per new target, after the initial 200 targets, has been cited E5ll . We 
have found too that a similar rate held in the empirical experiments of Section |5] Note that this 
condition also implies that N£ ' * ~ N*, for all j, which suggests replacement of N^ by N* in ©. 
Upon doing so, and after a bit of algebra, we arrive at the approximation 

N* - (ns + n T ) 

N LW « (n s + n T ) + K S — T> , (10) 

1 - w 

where w* = X/(n T + 1), X being the number of targets not discovered by traces to any other target. 

In other words, N LW can be seen as counting the n s + n T vertices in S U T separately, and 
then taking the remaining N* - (n s + n T ) nodes that were 'discovered' by traces and adjusting 
that number upward by a factor of (1 - w*)~ l . This form is in fact analogous to that of a classical 
method in the literature on species problems, due to Good [23|, in which the observed number 
of species is adjusted upwards by a similar factor that attempts to estimate the proportion of the 
overall population for which no members of species were observed. Such estimators are typically 
referred to as coverage-based estimators, and a combination of theoretical and numerical evidence 
seems to suggest that they enjoy somewhat more success than most alternatives lfT91l . 



5 Numerical Validation 

We examined the performance of the estimators proposed in Section HI using a methodology sim- 
ilar to those in ifTTl ITUl [LH . That is, we began with known graphs Q with various topological 
characteristics, equipped each with an assumed routing structure, performed a traceroute-like 
sampling on them, which yielded a sample graph Q* , and computed the estimators N RS and A\io- 
This process was repeated a number of times, for various choices of source and target nodes, at 
each of a range of settings of the parameters N, n s , and n T . A performance comparison was then 
made by comparing values of N/N, for = N*,N RS , and N LW - 



5.1 Design of the Numerical Experiments 

Three network topologies were used in our experiments, two synthetic and one based on measure- 
ments of the real Internet. The synthetic topologies were generated according to (i) the classical 
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Erdos-Renyi (ER) model [29| and (ii) the network growth model of Albert and Barabasi (BA) QUIl . 
This choice of topologies allows us to examine the effects of one of the most basic distinguishing 
characteristics among networks, the nature of the underlying degree distribution. In particular, the 
ER model is the standard example of a class of homogeneous graphs, in which the the degree dis- 
tribution P(k) has small fluctuations and a well defined average degree, while the BA model is the 
original example of a class of heterogeneous graphs, for which P(k) is a broad distribution with 
heavy-tail and large fluctuations, spanning various orders of magnitude. In our experiments, we 
have used randomly generated ER and BA networks with average degree 6, and sizes N ranging 
from 10 3 to 10 6 nodes. 

The ER and BA models are standard choices for experiments like ours, and useful in allowing 
one to assess the effect on a proposed methodology of a broad degree distribution, but they lack 
other important characteristics of the real Internet, such as clustering, complex hierarchies, etc. 
Therefore, we used as our third topology the Internet sample from MERCATOR OTTl . a graph with 
N = 228,263 nodes and M = 320, 149 edges. While there are newer Internet graphs, such as 
those from CAIDA [4], our choice of MERCATOR is influenced by the fact that it resulted from 
an attempt to have obtained an exhaustive map of the Internet in 1999. The aim in presenting such 
results is primarily illustrative. 

Given a graph Q, and a chosen set of values for N, n s , and n T , a traceroute-like study was 
simulated as follows. First, a set of n s sources S = [s\, . . . , s ns } were sampled uniformly at random 
from "V and a set of n T targets T = {t\,...,t nT } were sampled uniformly at random from <V \ S . 
Second, paths from each source to all targets were extracted from Q, and the merge of these paths 
was returned as Q* . Shortest path routing, with respect to common edge weights w e = 1, was 
used in collecting these simulated traceroute-like data, based on standard algorithms. Unique 
shortest paths were forced by breaking ties randomly. Other choices of routing between sources 
and targets, such as random shortest path and all shortest paths, have been found to lead to similar 
behavior with respect to discovery rates of nodes and links ifTH. After initial determination, routes 
are considered fixed, in that the route between a source u s e S and a vertex v e V is always the 
same, independent of the destination target u t e T. 

We note that the routing model used here is chosen simply as a first approximation to that 
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in the real Internet, and emphasize that the estimators proposed in Section |U are not derived in 
a manner that makes any explicit use of these routing assumptions. This model has been used 
in a number of recent papers ifTOl [P31 l"T21 IT4l and, although it does not account for all realistic 
subtleties, we have found, as in previous studies, that it appears to be sufficient for studying the 
essence of the issues at hand regarding inferences of Internet topology 'species'. Further studies 
could incorporate refinements of the model, such as the ones proposed in 021 . 

5.2 Results 

The plots in Fig. 0] show a comparison of N*/N, N RS /N, and N LW /N, for n s = 1, 10, and 100 
sources, as a function of qj. A value of 1 for these ratios is desired, and it is clear that in the case 
of both the resampling and the "leave-one-out" estimator that the improvement over the "trivial" 
estimator N* is substantial. Increasing either the number of sources n s or the density of targets 
q T yields better results, even for N*, but the estimators we propose converge much faster than N* 
towards values close to the true size N. 

Between the resampling and the "leave-one-out" estimator, the latter appears to perform much 
better. For example, we note that while both estimators suffer from a downward bias for very 
low values of qT, this bias persists into the moderate and, in some cases, even high range for 
the resampling estimator. This is probably due to the fact that the basic hypothesis of scaling 
underlying the derivation of N RS is only approximately satisfied, while for N LW , the underlying 
hypotheses are indeed well satisfied. Notice, however, that the "leave-one-out" estimator has a 
larger variability at small values of qj, while that of the resampling estimator is fairly constant 
throughout. This is because the same number B of resamples is used in calculating N RS in equation 
©, and the uncertainty can be expected to scale similarly, but in calculating N LW in equation ©, 
the uncertainty will scale with n T (and hence qr). 

In terms of topology, estimation of N appears to be easiest for the ER model. Even N* is 
more accurate i.e., the discovery rate is higher. Estimation on the MERCATOR graph appears 
to be the hardest, although interestingly, the performance of the "leave-one-out" estimator seems 
to be approximately a function of N*/N and n T and thus quite stable in all three graphs. The 
MERCATOR graph has a much higher proportion of low-degree vertices than the two synthetic 

14 



graphs, which therefore have particularly small betweenness (and thus lie on very few shortest 
paths) and are very difficult to discover. On a side note, we mention too that the resampling 
estimator behaves in a rather curious, non-monotonic fashion in two of the plots, as q T grows. At 
the moment, we do not have a reasonable explanation for this behavior, although we note that it 
appears to be constrained to the case of the BA graph and that some indication of this behavior can 
already be seen for this graph in Fig.|2J 

In Fig. we investigate, at fixed n s and q T , the effect of the real size of the graph N. Interest- 
ingly, the estimators perform better for larger sizes, while N*/N on the contrary decreases. This is 
due to the fact that the sample graph Q* gets bigger, providing more and richer information, even 
if the discovery ratio does not grow. The odd nature of the results for the BA graph comes from 
the peak associated with the resampling estimator mentioned earlier; see Fig.HJ At a fixed number 
n T of targets, however, the quality of the estimators N RS and N LW gets worse as N increases, as 
shown in Fig. |6j 

6 Discussion 

In this paper, we have investigated the problem of inferring a network's properties from traceroute- 
like measurements in the framework of the so-called 'species' problem. As a first example of ap- 
plication, we have focused on the issue of estimating the real size N of a network from only the 
knowledge of the sampled graph. Despite the fact that species problems often can be quite diffi- 
cult, in this case we find it is possible to propose an estimator that, based on our empirical studies, 
works quite well, even at quite low sampling densities. 

While the present study provides a first promising step that clearly illustrates the relevance of 
the species problem in Internet inference, numerous issues remain to be explored, even simply in 
the case of estimation of N alone. For example, the proposed estimators could be evaluated with 
other types of networks. Similarly, one could examine the effect of non-random source placement 
(e.g., restricted only to the fringe of the network), as well as that of more realistic traceroute 
models [ 32 1 . However, in the case of these latter two changes, we would not expect the 'leave-one- 
out' estimator to suffer much in performance, since its derivation assumes only uniform random 
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choice of targets, and not sources, and furthermore makes no explicit assumptions about routing. 
The effect of the inclusion in Q* of the paths from each source to the other sources should as well 
be investigated. 

Our results showed that the 'leave-one-out' estimator performed noticeably better than the 
resampling estimator. Nevertheless, the resampling estimator should not be summarily dismissed 
quite yet. In particular, while the derivation of the 'leave-one-out' estimator is quite specific to the 
problem of estimating N, the derivation of the resampling estimator is general and independent 
of what is to be estimated. Initial experiments indicate that in estimating the number of edges M 
in a network, for example, a resampling estimator yields similar improvements over the observed 
value M* = \E*\ as seen in estimating N. On the other hand, it is not immediately apparent how the 
leave-one-out principle might be applied to estimating M, as it is nodes (i.e, targets) and not edges 
that are chosen at the start of a traceroute study. 

Finally, it is worth recalling the broader issue raised by this paper: the fact that the problem of 
estimating a characteristic rj(Q) of a network graph Q, based on a sampled subgraph Q*, is as yet 
poorly understood. We have taken the case of r]{@) = N as a prototype to explore and illustrate. 
However, for this case alone there are natural alternatives that one might consider. For example, 
an experiment that used ping to test for the response of some sufficient number n of randomly 
chosen IP addresses could yield an estimator a of the fraction of 'alive' addresses and, in turn, 
an estimator N ping = 2 32 a that is much simpler than either of those proposed in this paper. We 
have in fact performed such an experiment, with n = 3, 726, 773 ping's sent from a single source, 
yielding 61, 246 valid responses (for a 1.64% response rate), and resulting in an estimate N ping = 
70, 583, 737. We then performed a traceroute study from the same source to the 61,216 unique 
IP addresses, and calculated a 'leave-one-out' estimate on the resulting G* of N L \o = 72, 296, 221. 

Of course, neither of these numbers are intended to be taken too seriously in and of themselves. 
The point is that, while the estimator from traceroute data is arguably less intuitive and direct 
in its derivation than that from the ping data, for the particular task of estimating N, it nonetheless 
produces essentially the same number. And, most importantly, while the ping data would of course 
not be useful for estimating M or degree characteristics, for example, the use of traceroute 
measurements, which produce an entire sampled subgraph Q*, does in principle allow for the 
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estimation of either of these quantities. The success of the 'leave-one-out' estimator therefore 
demonstrates both the importance and the promise of a 'species '-like perspective in the estimation 
of Internet characteristics. 
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Appendix 

We derive here the estimator Nlw of equation ©. Starting from equation ©, and substituting 
and X in the numerator for E[N?_y] and E[X], respectively, our task reduces to deriving an 
estimator of (n T - .ELY]) -1 = (n T q)~ l , where q = 1 - (E[X]/n T ). 

Recall that X = £ ; 8j is the sum of n T Bernoulli (i.e., or 1) random variables. If the Sj were 
independent and identically distributed (i.i.d.), with Pr(5 ; - = 1) = p, then X would be a binomial 
random variable, with parameters n T and p. In this case, the relation qE[(n T + l)/(n T + 1 - X)] = 
1 - p n+1 holds, from which it follows that the quantity (n T + l)/(n T + 1 - X) has expectation 
q~ l {\ - p n+1 ) « q~ l , and therefore is an approximately unbiased estimator of q~ x . Substitution of 
the quantity (n T + l)/[n T (n T + 1 -X)] for (n T q)~ l = (n-E[X])~ l in © then completes the derivation 
of®. 

Of course, the variables 6j are not precisely i.i.d., due to the commonality of sources and targets 
underlying the definition of the sets *Vj?y However, the 6j share the same marginal distribution 
(i.e., with p = (N - £ , [A^ ( *_ ) ])/(7Y - n s - n T + 1) ), and it may be argued that they are pairwise 
nearly independent under the condition TYf ^ w N^_.^ ~ Nf_,_j,y These two facts together suggest 
that a binomial approximation to the distribution of X should be quite accurate. It remains to argue 
for the latter fact, for which it is sufficient to show that Pr(£ ; - = 0\dj> = 0) « Pr(£ ; = 0) = 1 - p. 
By conditioning on the sets < V* ( _ j) and r V* { _ v \, counting arguments similar to those underlying the 
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derivation of equation © yield that 
Pr(6j = 0\6 r = 0) = E 



Af* - 1 N* N* N* - N* 



N -m-n + 1 Af* V) N - m - n + \ Af ( * V) 



N* - 1 N* 



1 + 



N—m—n+1 N—m—n+1 



l-p 
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Sources Tar § ets 

Figure 1: Illustration of the traceroute-like procedure. Shortest paths between the set of sources 
and the set of destination targets are discovered (shown in full lines) while other edges are not 
found (dashed lines). In the case of degenerate shortest paths, only one is found. 
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Figure 2: A comparison of the quantities N*/N and E*[N**]/N* = E[N**\G*]/N*, as a function 
respectively of q T = n T /N and q* T = n* T /N*, for the three networks described in Section |3 Here 
q* T = q T and n s = n* s = 10. Top row shows the averages of Af*//V and E*[N**]/N* over 100 
realizations of G*. Bottom row shows the average of the difference of these two quantities, relative 
to N*/N, over the same 10 realizations. The comparison in the top row confirms the validity 
of the scaling assumption underlying the resampling estimator derived in Section R~T1 while the 
comparison in the bottom row indicates better performance of the estimator can be expected with 
increasing q T . (Note: One standard deviation error bars are smaller than the symbol size in most 
cases.) 
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Figure 3: Illustration of the obtention of the resampling estimator, in the case of a BA graph of 
size N = 10 5 . The initial sampling was obtained with n s = 10 sources and n T = 10 4 targets 
(q T = 0.1), yielding a graph Q* of size N* = 33178 (and M* = 133344). The circles show the 
ratio of the average size of the resampled graph Q**, N** IN*, as a function of the ratio n* T ln T , with 
n* s = n s = 10 sources. The errorbars give the variance with respect to the various placements of 
sources and targets used for the resampling. The straight line is y = x and allows to find the value 
of n* T such that n T /n* T = N*/N** 
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Figure 4: Comparison of the various estimators for the BA (top), ER (middle) and Mercator (bot- 
tom) networks. The curves show the ratios of the various estimators to the true network size, as 
a function of the target density q T . Full circles: Nlw/N ; Empty squares: N RS /N; Stars: N*/N. 
Values and one standard deviation error bars are based on 100 trials, with random choice of sources 
and targets for each trial. Left figures: n s = 1 source; Middle: n s = 10 sources; Right: n s = 100 
sources. 
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Figure 5: Effect of the size N of the graph Q for BA and ER graphs at constant number of sources 
and density of targets. The curves show the ratios of the various estimators to the true network 
size, as a function of the graph size N. Full circles: Nlw/N ; Empty squares: N RS /N; Stars: N*/N. 
Values and one standard deviation error bars are based on 100 trials, with random choice of sources 
and targets for each trial. n s = 10. Left figures: q T = 10" 3 ; Middle: q T = 10~ 2 ; Right: q T = 10 -1 . 
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Figure 6: Effect of the size N of the graph Q for BA and ER graphs at constant number of sources 
and targets. The curves show the ratios of the various estimators to the true network size, as a 
function of the graph size N. Full circles: N LW /N ; Empty squares: N RS /N; Stars: N*/N. Values 
and one standard deviation error bars are based on 100 trials, with random choice of sources and 
targets for each trial. n s = 10. Left figures: n T = 10 2 targets; Middle: n T = 10 3 targets; Right: 
n T = 10 4 targets. 
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